Skip to content

Add YOLO26 object detection contrib model#151

Open
jimburtoft wants to merge 3 commits intoaws-neuron:mainfrom
jimburtoft:contrib/yolo26
Open

Add YOLO26 object detection contrib model#151
jimburtoft wants to merge 3 commits intoaws-neuron:mainfrom
jimburtoft:contrib/yolo26

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Summary

  • Adds Ultralytics YOLO26 object detection models (n/s/m/l/x, 2.4-58.9M params) as a contrib model for real-time inference on AWS Trainium2 and Inferentia2 via torch_neuronx.trace()
  • All 5 detection variants compile and run with high accuracy (CosSim 0.987-0.997), plus pose and OBB task heads
  • Neuron outperforms compiled A10G GPU by 1.4-4.5x on s/m/l/x variants at peak DP throughput

Validation

Validated on 4 configurations: trn2.3xlarge × {SDK 2.28, 2.29} and inf2.xlarge × {SDK 2.28, 2.29}.

Instance SDK Tests yolo26n CosSim yolo26s CosSim yolo26n img/s yolo26s img/s
trn2.3xlarge 2.28 13/13 pytest 0.9943 0.9931 32.3 66.0
trn2.3xlarge 2.29 13/13 pytest 0.9941 0.9931 33.2 65.5
inf2.xlarge 2.28 6/6 standalone 0.9965 0.9931 60.1 64.1
inf2.xlarge 2.29 6/6 standalone 0.9965 0.9931 69.7 76.7

Peak Throughput (trn2.3xlarge, LNC=1, DP=8)

Variant Params Dtype img/s vs A10G Compiled
YOLO26n 2.4M FP32 272 0.13x
YOLO26s 10.0M FP32 1,523 1.43x
YOLO26m 21.9M BF16 1,267 2.67x
YOLO26l 26.3M BF16 1,093 2.95x
YOLO26x 58.9M BF16 876 4.49x

Files

contrib/models/YOLO26/
  README.md                          # Model card, benchmarks, compatibility matrix
  yolo26_neuron_notebook.ipynb       # Complete workflow notebook (tested end-to-end)
  src/
    __init__.py                      # Exports: YOLO26NeuronModel, compile_yolo26, etc.
    modeling_yolo26.py               # Trace wrapper, DP support, validation (~280 lines)
  test/
    __init__.py
    unit/__init__.py
    integration/
      __init__.py
      test_model.py                  # 13 integration tests (compile, accuracy, DP, perf)

Key Design Decisions

  • torch_neuronx.trace() (not NxDI model classes): YOLO26 is a CNN with no KV cache, no attention matrices, no token generation. All variants fit on a single NeuronCore (<180 MB NEFF). Data Parallelism provides throughput scaling.
  • end2end=False: topk/sort operations are not supported on Neuron (NCC_EVRF029). Raw [B, 84, 8400] output with CPU postprocessing (~0.1ms overhead).
  • BF16 for m/l/x: FP32 exceeds SB allocation for larger variants (NCC_IGCA030). n/s use FP32.
  • No --auto-cast flags: matmult autocast produces NaN for Conv2d-dominant models.
  • LNC-aware compilation: --lnc 1 compiler flag required when running on LNC=1 mode.

Target

aws-neuron/neuronx-distributed-inference main branch.

Ultralytics YOLO26 (n/s/m/l/x) on Trainium2 via torch_neuronx.trace().
All 5 detection variants plus pose and OBB task heads compile and run
with high accuracy (CosSim 0.987-0.997).

Peak throughput on trn2.3xlarge (LNC=1, DP=8):
- YOLO26s: 1,523 img/s (1.43x vs A10G compiled)
- YOLO26m: 1,267 img/s (2.67x vs A10G compiled)
- YOLO26l: 1,093 img/s (2.95x vs A10G compiled)
- YOLO26x:   876 img/s (4.49x vs A10G compiled)

Includes modeling module, 13 integration tests (all passing),
Jupyter notebook, and README with benchmarks.
Tested all 4 combinations:
- trn2.3xlarge SDK 2.28: 13/13 pytest passed
- trn2.3xlarge SDK 2.29: 13/13 pytest passed
- inf2.xlarge SDK 2.28: 6/6 standalone tests passed
- inf2.xlarge SDK 2.29: 6/6 standalone tests passed

inf2 single-core throughput: yolo26n 60-70 img/s, yolo26s 64-77 img/s.
Updated compatibility matrix and notebook prerequisites.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant